Spatiotemporal distribution, trend, forecast, and influencing factors of transboundary and local air pollutants in Nagasaki Prefecture, Japan

The study of PM2.5 and NO2 has been emphasized in recent years due to their adverse effects on public health. To better understand these pollutants, many studies have researched the spatiotemporal distribution, trend, forecast, or influencing factors of these pollutants. However, rarely studies have combined these to generate a more holistic understanding that can be used to assess air pollution and implement more effective strategies. In this study, we analyze the spatiotemporal distribution, trend, forecast, and factors influencing PM2.5 and NO2 in Nagasaki Prefecture by using ordinary kriging, pearson's correlation, random forest, mann–kendall, auto-regressive integrated moving average and error trend and seasonal models. The results indicated that PM2.5, due to its long-range transport properties, has a more substantial spatiotemporal variation and affects larger areas in comparison to NO2, which is a local pollutant. Despite tri-national efforts, local regulations and legislation have been effective in reducing NO2 concentration but less effective in reducing PM2.5. This multi-method approach provides a holistic understanding of PM2.5 and NO2 pollution in Nagasaki prefecture, which can aid in implementing more effective pollution management strategies. It can also be implemented in other regions where studies have only focused on one of the aspects of air pollution and where a holistic understanding of air pollution is lacking.

www.nature.com/scientificreports/ factors had a strong positive and negative correlation, with others having a weak positive and negative correlation. Average temperature, maximum temperature, minimum temperature, average humidity, and minimum humidity had a strong negative correlation, with average local pressure and average sea level pressure having a strong correlation. On the other hand, the southern oscillation index, rain maximum 10 min, average wind speed, and sunlight time had a weak negative correlation, with maximum wind speed and maximum instantaneous wind speed having a weak positive correlation. The other factors had mixed results, with some stations having positive, negative, or no correlation.   The RF model feature selection results at Tsushima, Goto, Daitou, Inasa, Obama, and Yoshii for PM 2.5 and NO 2 are shown in (Fig. 5). These stations were selected based on their location, which are representative of the study area. Among the monitoring stations, the most important predicting factors for PM 2.5 were Spring, maximin instantaneous wind speed one day, maximum instantaneous wind speed, humidity, sunlight time one day, and southern oscillation index. At the observation tower of Tsushima, Goto, Inasa, and Yoshii, Spring is among the three major predicting factors of PM 2.5, with Spring being the primary predictor at Goto. Maximum instantaneous wind speed was among the three most important predictors for Goto, Daitou, Insas, and Yoshii and was the primary predictor at Tsushima. Maximum instantaneous wind speed was the main predictor for Daitou, Inasa, and Yoshii and the second most important predictor for Obama. Humidity was the main predictor of Obama and the third predictor at Goto. Southern oscillation index was only a significant predictor at Tsushima.
On the other hand, for NO 2, the three most important predictors that were among the stations were average wind speed, minimum temperature one day, maximum instantaneous wind speed, average sea level pressure one day, average temperature one day, average sea level pressure three days, southern oscillation index seven days, average temperature, maximum temperature one day and average local pressure one day. Average wind speed was among the three main predictors in Tsushima, Goto, Daitou, Inasa, and Yoshii, and the main predictor in Goto. Minimum temperature was the major predictor in Tsushima and Daitou. In Goto and Daitou, maximum instantaneous wind speed was the second and third major predictor, respectively. Sea level pressure one day was the main predictor in Inasa and Yoshii. Average temperature one day, average sea level pressure three days, southern oscillation index seven days, average temperature, maximum temperature one day, and average local pressure one day were among the three most important factors in only one of the stations. The feature selection result from RF indicated that the factors influencing PM 2.5 and NO 2 concentrations in Nagasaki Prefecture vary depending on the location of the monitoring stations. Tables 3 and 4 show the results of the random forest models for each of the 18 stations for PM 2.5 and NO 2 , respectively. Model accuracy was evaluated using R 2 and MSE. In the case of PM 2.5 , the result indicated that the accuracy estimates for the 18 stations are varied with R 2 values in the range of 0.41-0.53 and MSE of 22.7-37.6 for the training dataset and R 2 values ranging from 0.16 to 0.33 and MSE 32.7-51.3 for the test dataset. The low values of R 2 , high values for MSE, and the high difference of R 2 between the trained model and the test model indicated that the RF models constructed with these factors could not be used to predict PM 2.5 concentrations.
Whereas, in the case of NO 2 , R 2 was higher (Test: 0.354-0.735) than the R 2 values of PM 2.5 ; thus, the factors used in this study are a better predictor of NO 2 concentration than PM 2.5 . However, the R 2 values for most of the NO 2 test models are still low to be used to predict NO 2 concentrations, except for the results of Isahaya, which can be considered acceptable.
Trend and Forecast analysis. At a 0.05 significance level, the Mann-Kendall test determined that PM 2.5 and NO 2 in most of the monitoring stations had a monotonic trend and a negative slope ( Table 5). The stations with no monotonic trend for PM 2.5 were Shimabara, Oomura, Kawadana, Togitsue, MatsuuraShimachi, and Tsushima, and for NO 2 were Yukiura, Tsushima, Iki, Obama, and Muramatsu. The stations that had the most significant magnitude of reduction for PM 2.5 were Daitou (− 1.278 μg/m 3 ), Fukuishi Jihai (− 1.178 μg/m 3 ), and Kogakura (− 1.01 μg/m 3 ), while Goto (− 0.43 μg/m 3 ) had the lowest. For NO 2 , the most significant magnitude of reduction was observed in Fukuishi Jihai (− 0.78 ppb), Higashi Nagasaki (− 0.50 ppb), and Kogakura (− 0.49 ppb), and the lowest in Matsuura Shisamachi (− 0.12 ppb). Figures S1 and S2 represent the data decomposition and tend for six monitoring stations.
The results of Holt-Winters and ARIMA forecast analysis are presented in Table 5 Figure 6 shows the forecast results for PM 2.5 and NO 2 for six monitoring stations using the best model, ETS, Holt-Winters, or ARIMA (Table 5). PM 2.5 for Tsushima, Inasa, Yoshii and Obama was forecasted with ARIMA, while Goto and Daitou were forecasted with Holt winters. The forecast of PM 2.5 produced by ARIMA in Inasa   www.nature.com/scientificreports/ For PM 2.5 , in general, the future forecast indicates a negative trend. However, the future concentration of PM 2.5 will remain above the 2021 WHO recommendations (5 μg/m 3 ). Also, in most stations, the future concentrations will stay above or below the 2005 WHO recommendations (10 μg/m 3 ), depending on the season, except for Daitou, which shows that the future concentrations will decline below the 2005 recommendations. For NO 2 , the forecast shows a very slight declining trend for the majority of the stations. Compared to the other stations in Yoshii, the decline is more consistent. For NO 2 , all the future forecasts are below the 2005 WHO recommendations (21 ppb). Also, for most stations, the NO 2 concentration will be below or above the 2021 WHO recommendations (5 ppb), depending on the season. Yoshii and Obama are exceptions, as NO 2 concentration levels are below the 2021 recommendations.

Discussion
Acknowledging that air pollution has adverse health effects, even at the lowest observed levels, is crucial for reconsidering current legislation and regulation. Thus, reducing the health impacts caused by the average annual exposure to NO 2 and PM 2.5 needs to be prioritized to address known inequities owing to economic activities, socioeconomic conditions, and increased vulnerability of the residential population 8 . Although regulation and legislation in Japan have effectively reduced PM 2.5 and NO 2 concentrations over the years, and it is considered The difficulty of regulating and reducing PM 2.5 concentration in Nagasaki Prefecture is due to the long-range transport of PM 2.5 from East Asia and Eurasia 28 . As a result of the long-range transport characteristic of PM 2.5 , its spatial distribution and concentration vary throughout the study period as it is affected by climatic and temporal factors. For instance, Pearson's correlation and the random forest feature selection indicated that the most important factors influencing PM 2.5 were Spring, maximum instantaneous wind speed, humidity, sunlight time one day, and southern oscillation index. In Spring, PM 2.5 concentrations are higher than in other seasons (Table S2). This is due to the changes in meteorological   29 . This result is further reinforced by Fig. 2, which suggests that from 2014 through 2021, the highest concentration of PM 2.5 are located in the westernmost part of Nagasaki Prefecture. Several studies have indicated that PM 2.5 is transported to Nagasaki Prefecture from East Asia; thus, the proximity of the westernmost part of Nagasaki's Prefecture to East Asia, its downwind location, and the change of wind direction in Spring are the main reasons for high PM 2.5 concentrations detected during the study period. The less affected areas are those located in the easternmost part, which is further away from East Asia. The wide distribution of PM 2.5 and its spatial variability thought the study period makes it difficult to regulate and identify specific hotspots. Its wide distribution is also a cause for concern as it has health implications for many of the resident population in Nagasaki Prefecture. However, during the study period, as indicated by Table 2, Figs. 2, 3, and 5, there has been a decline in PM 2.5 concentrations. This decline in PM 2.5 in Nagasaki Prefecture is related to the decrease in PM 2.5 concentrations in China and Korea. This reduction can be attributed to the changes in policy, technology, social, environmental, and economic factors in Japan, Korea, and China. For instance, the changes in environmental policies and the tri-national cooperation between these countries have generated positive results in reducing PM 2.5 (see Section "Policy background" Table 1). Also, the restrictions on social and economic activities imposed due to the COVID-19 pandemic resulted in a notable reduction of PM 2.5 in 2020 and 2021. Although PM 2.5 shows a declining trend, better local and regional strategies are needed to reduce PM 2.5 further as the pollution levels are above the WHO guidelines amidst local and tri-national efforts.
As for NO 2 , the average annual concentration is below the 2005 pollution guidelines. However, the results indicated that the hotspots identified are above the WHO 2021 pollution concentration guidelines. NO 2 pollution concentrations are also influenced by climatic and temporal factors, as indicated by Pearson's correlation and random forest feature selection analysis. For NO 2 , average wind speed was negatively correlated due to the dilution and dispersion of pollutants. However, maximum wind speed and maximum instantaneous windspeed were positively correlated, which can be attributed to the notion that the NO 2 plum is buoyant, but at higher wind speeds, the plum is brought down to ground level 30 . Temperature was negatively correlated with NO 2 ; temperature is known to promote air convection, leading to pollution dispersion and dilution 31 . Average local pressure and average sea level pressure were positively correlated due to the low atmospheric boundary layer, which accompanies high pressure and prevents air pollutants' vertical dispersion 29 . Sunlight time was negatively correlated to NO 2 ; this can be attributed to the photochemical reactions of solar radiation, which reduced NO 2 concentration. Amidst the influence of climatic factors on NO 2, its spatial distribution remained constant throughout the study period with consistent hotspot areas, except for 2020, where the pollution concertation was the lowest and more dispersed with no visible hotspot. From 2013 to 2019 and 2021, hotspots were located in Nagasaki's Prefecture major cities, Nagasaki, Sasebo, Isahaya, and Oomura. Nagasaki and Sasebo are the two largest cities in Nagasaki Prefecture with the highest concentrations of NO 2 throughout the study period; this is because of the economic activities in the area associated with shipbuilding, power plants, machinery, and heavy industries and also the burning of fossil fuels, especially from the transport sector. The lowest concentrations of NO 2 were in 2020-2021, as indicated by Table 2 and Fig. 3; this remarkable reduction of NO 2 can be attributed to the restrictions imposed by the Japanese government on social and economic activities due to the COVID-19 pandemic. The reduction of NO 2 in 2018-2019 could be due to the decommissioning of the Ainoura Power Station, a crude oil-fired power  22 and also the regulation of emissions from stationary sources such as fossil fuel powerplants, electric and industrial boilers. Pearson's correlation and the random forest feature selection identified major factors influencing PM 2.5 and NO 2 and provided a good indication of the complex relationship between the significant climatic and temporal factors and PM 2.5 and NO 2 pollutants in Nagasaki Prefecture. However, the results indicate that the correlation and factors of importance that influence PM 2.5 and NO 2 vary depending on the monitoring station. These differences observed in terms of correlation, factors of importance, tend, and model performance among the 18 stations can be attributed to the varying unique characteristics of climatic, environmental, social, and economic factors in each location, which affect PM 2.5 and NO 2 concentrations. For instance, in the case of Goto , the major predictor of PM 2.5 is Spring and humidity in Obama. This difference can be attributed to the location of these monitoring stations. Goto is located in the westernmost part of Nagasaki Prefecture, which is the area most affected by the long-range transport of PM 2.5 from East Asia in Spring, as opposed to Obama, which is located in the easternmost part of Nagasaki Prefecture, which is the least affected by the seasonal changes. Although RF www.nature.com/scientificreports/ affecting the spatiotemporal distribution and trend of PM 2.5 and NO 2 in Nagasaki Prefecture. And also to assess the differences that exist (e.g., trend, influencing factors, etc.) among the monitoring stations.

Materials and methods
Study site. Nagasaki Prefecture is located on the island of Kyushu (Fig. 7). The prefecture has an area of approximately 4,105 km 2 with a population of 1,377,187. Nagasaki borders Saga Prefecture on the east and is surrounded by the Tsushima Straits, the Ariake Bay, and the East China Sea. Nagasaki air pollution is relatively low but is influenced by transboundary air pollution from Asia and Eurasia 28,33,34 . Studies conducted in Nagasaki have demonstrated that air pollution has adverse health effects, especially in children 14,16 . Moreover, 8.3 and 29.6% of the population in Nagasaki Prefecture are less than or equal to 10 and more than or equal to 65 years of age, respectively. Therefore, they are considered vulnerable to air pollutants 35  Ordinary kriging. Ordinary kriging (OK) interpolation is suitable for PM 2.5 and NO 2 concentration mapping as it is a commonly used geostatistical estimator in air pollution interpolation and is often referred to as the unbiased estimator 36 . Ordinary kriging models the unsampled value z*(x 0 ) as a combination of neighboring observations n, Eq. (1), 37 : where z*(x 0 ) estimate value at x 0 , Z(x i ) measure value at x i and λ i weight is assigned for the residual of Z(x i ).
Semivariogram. We derived the experimental semi-variogram for the 18 datasets to determine the spatial autocorrelation and the spatial structure of data points. The semi-variograms are expressed as a function of the distance between data points and explain the measured points' spatial relationship, Eq. (2), 38 .
where γ(h) quantity function of increment h, N(h) numbers of pairs separated by the vector h, Z(x i ) is the sampled values at location x i and Z(x i + h) sampled measurements at location X i + h. In this study, we fitted the experimental semi-variogram to two theoretical semi-variogram models: exponential and spherical, two of the most commonly used models 39 . The parameters determined were: range (a) the distance up until which the regionalized variable is auto-correlated, partial sill (c) which is the spatially structured part of the residuals, and the nugget (c 0 ) the non-spatial variability 40 . The spherical and exponential models are defined by Eqs. (1) www.nature.com/scientificreports/ Cross validation. The model's prediction ability of the unsampled PM 2.5 and NO 2 locations was conducted using cross-validations to calculate the mean error (ME), standard mean error (SME), root mean square error (RMSE), root mean square standard error (RMSSE) and average standard error (ASE). We analyzed the crossvalidation results for both spherical and exponential models; the model with better results was selected for interpolating PM 2.5 and NO 2 (Table S1). The RMSE and RMSSE are defined by Eqs. (5) and (6), respectively 42 .

Exponential Model
where N number of validation points, Z(x i ) measured value and Z*(x i ) standard values being Z 1 (x i ) and Z 2 (x i ).
A RMSE closer to 0 and a RMSSE closer to 1 depict that the parameters and fitting model are excellent and the kriging estimators are robust.
Pearson's correlation and random forest. Pearson's correlation analysis was conducted for each of the 18 monitoring stations to produce a heatmap depicting the correlation between major climatic and temporal factors and PM 2.5 and NO 2 pollutants (Table 6). We then used random forest (RF) to identify the most important climatic and temporal factors influencing PM 2.5 and NO 2 in each of the 18 stations 43 . The factors identified were then used to construct the RF models for each of the 18 stations to make PM 2.5 and NO 2 predictions. For each of the 18 stations, the random forest models were trained with 80% and validated with 20% of the respective monitoring station data. The RF model was then evaluated using the root mean square error (R 2 ) and the mean, standard error (MSE). Random forest modeling is a type of ensemble learning method used for classification and regression analysis. It is well known to have advantages in terms of accuracy, robustness, and computational efficiency compared to other models 44 . The RF model was constructed using the open-source machine learning library scikit-learn *2 on Python. Next, the categorical factors were converted into dummy or indicator factors using the Python Pandas method (get dummies) (Tables 6 and 7). Furthermore, 7-day time lag data was added to the climatic factors to confirm the influence of past dependent factors. Finally, hyperparameters were determined with the ranges and steps indicated in (Table 8) using the grid-search technique for optimal model construction.  where x j and x i time series and n number of data points in the time series. Where t p number of ties up to sample p. A positive Z value signifies a rising trend, a negative Z signifies a descending trend for the data period. The sens.slope function was used to calculate the Sen's slope which indicated the magnitude of the trend. Equation (8) gives the slope for all data pairs and Eq. (9) the median of the n values of T i , Sen's slope estimator (Q i ) 47 .
where Ti slope and x j and x k data values at time j and k.
A positive Q i signifies a rising trend; a negative Q i signifies a declining trend over time. Both Mann-Kendall and Sen's slope consider the seasonality of the data. The trend package in R was used to do the correlated seasonal Man-Kendall test and the seasonal Sen's slope tests 48 . Both functions do not operate on missing data; therefore, the tsclean function in the forecast package was used 49,50 . To obtain the trend of PM 2.5 and NO 2 , we decompose the time series data into a trend, seasonal and irregular components by using the stl (seasonal decomposition of time series by LOESS) function developed by William Cleveland 51,52 . The stl function from the stats package was used to fit the loess to the data and the tsclean function in the forecast package was used to identify and replace outliers and missing values before applying the stl function. Then the stl function from the stats package was used to fit the loess to the data. The mean absolute percentage error (MAPE) and root mean square error was computed to determine if the component after the LOESS decomposition had satisfactorily captured the PM 2.5 and NO 2 data information. The goodness of fit of the trend line was determined by checking the residuals; this was done by using the checkresiduals function from the forecast package.
Both exponential smoothing and ARIMA models were evaluated for the forecast analysis. These methods have been used to perform air pollution forecast analysis and, in some cases, have performed better than deep learning methods. First, the Augmented Dickey-Fuller test (ADF Test) was performed to ensure the stationarity of the time-series data 53 . Once stationarity was confirmed, the two models were trained and tested with the 2013-2019 and the 2020-2021 pollutants datasets, respectively. Next, validation was performed using the test set whereby the mean absolute percentage error (MAPE) and root mean square error were computed to determine if the EST and ARIMA had satisfactorily captured the information of the PM 2.5 and NO 2 data. The models with the lowest AIC were then used to do the forecasting of both PM 2.5 and NO 2 .
Exponential smoothing (ES) forecasting methods and models. Brown, Winter and Holt introduced the exponential smoothing 54 . Gardner 55 extensively reviews the various ES methods. The exponential smoothing forecasting formulation consists of the forecast method and the statistical model. The forecast method uses an algorithm to produce a point forecast which is a prediction of a single value whereas the forecast statistical model is a process which generated an entire probability distribution with several values which when averaged generates a point forecast and provides prediction intervals with a level of confidence 54 .
The exponential smoothing forecasting method is based on the idea that the forecast produced are weighted averages of past observations, with the weight associated to each observation exponentially decreasing as the n is odd www.nature.com/scientificreports/ observation gets older 54,55 . Model formulations are of component (recursive) form and error correction form 54 . The error correction form is derived from the rearrangement of the equation in the component form. This error correction form uses the state space approach to exponential smoothing method since it consists of a measurement (observed) equation and a state (transition) equation. These two equations with their error distribution constitute a specified statistical model know as state space model. Since all observations and state variables uses the same error process it is called "single source of error" (SSOE) or "innovation" and more specifically known as "innovation state space model. The single source of error (SSOE) was formulated by Snyder 56 . Pegels provided classification of the trend and the seasonal patterns depending on whether they are additive (linear) or multiplicative (nonlinear) 56 . The family of exponential smoothing forecasting methods can be systematically described as a combination of level, trend, and seasonality 54,58,59 . Each one can be of either an additive character or multiplicative character. The trend component can be classified as having no trend, additive trend, additive damped trend, multiplicative trend, and multiplicative damped trend 55,[57][58][59] . The simplest classification is the single exponential smoothing (SES) method, which considers only the constant level model and uses data with no trend or seasonality. This method consists of a forecast and smoothing equations for the level. The Holt linear trend method, also known as Double Exponential Smoothing (DES), consist of a forecast equation, and two smoothing equations: a level equation and a trend equation.
The Holt-Winters seasonal method, also known as the triple exponential smoothing (TES), consist of a forecast equation and three smoothing equations: a level equation, trend equation, and seasonality equation. The family classification of exponential smoothing generates a combination of 15 exponential smoothing methods with different components 58,59 . Rearranging the terms in the different components for each of the 15 exponential smoothing methods (i.e., level component, trend component, and seasonal component), generate an error correction form model for each of the 15 methods with each having an additive or a multiplicative error model thus producing a total of 30 error models. These error-correction form models, also known as "innovative" state space models, are labeled as ETS ( ; ; ), representing Error, Trend, and Seasonal. The forecast equation is the measurement equation, and the smoothing equations becomes the state equation, with both having the same source of error 54 .
Of the 15 exponential smoothing methods, six were considered here. These are the Holts linear trend method, Holt linear damped trend, Holt-Winters additive, Holt-Winters additive damped, Holt-Winters multiplicative, and Holt-Winters multiplicative damped component. These six methods are converted to their error correction components form with their respective additive and multiplicative error correction model, yielding 12 error correction models. These 12 error correction models were used for model selection (Table S3).
The innovative state space model forecasting for each univariate time series was generated using the ets() function in the forecast package in R 49,58 . Two procedures using ets() function for model selection were used: the automatic selection and the manual selection of a model. The automatic selection of the ETS models provides options for which models to be evaluated and selects the most appropriate model given the data. The model option used was model = "ZAZ" where the first Z represents either additive or multiplicative error, the second Z represents automatic selection in which the choices are no seasonality, additive seasonality, or multiplicative seasonality, and the A represents an additive trend. Based on this "ZAZ" option, 12 models were evaluated from the 30 error models available. Selection of the best-fitted model from the 12 models was based on the minimization of the corrected Akaike Information Criterion (AIC), which avoids over-fitting by considering both goodnesses of fit and model complexity.
The manual model selection procedure used was selecting the hw() function from the forecast package in R. This function selects the Holt-Winters additive model, which corresponds to the ETS(A;A;A) in the ets () function which stands for additive error, additive trend, and additive seasonality 55 . The Ljung-Box Q test was used for residual diagnostics to determine whether the residuals were white-noise sequences. The Box.test was used from the stata package in R.
The ARIMA models. Slutsky, Walker, Yaglom, and Yule first articulated autoregressive (AR) and moving average (MA) models. Box & Jenkins integrated the existing knowledge formulating ARIMA, known as the Box-Jenkins approach 60 . An autoregressive model (AR) assumes the forecasted value is a linear combination of the past values of the variable, and moving average models (MA) assumes a linear combination of past forecasting errors. Combining these two models, AR and MA, produces an ARMA model. If the time series is non-stationary then the series are differenced to create stationarity before modeling, then the I is introduced in the ARMA. The I in an ARIMA model represents the integration parameter produced by differencing. The non-seasonal ARIMA models have parameters p, d, and q. The p represents the lag order of the autoregression, the d is the order of the differencing, and the q is the order of the moving average for the non-seasonal part. The seasonal ARIMA, also known as SARIMA, incorporates an additional set of terms, like the ARIMA models, that considers the seasonal effects. The seasonal parameters incorporated are P, D, Q, and m. The P, D, and Q represent the lag order of the autoregression, the order of the differencing, and the order of the moving average for the seasonal part, and the m represents the number of periods in each season. Box et al. and Chatfield 61,62 expressed the AR(p), MA(p), and ARMA (p, q), mixed seasonal ARMA(p,q)(P,Q) m , ARIMA(p,d,q), and mixed seasonal ARIMA -SARIMA(p,d,q) (P,D,Q) m models.
ARIMA forecast was done using the automatic ARIMA algorithm for model selection using the auto.arima() function from the forecast package in the R program 49,50 . Two procedures were used in the automatic selection. The first method used the default settings (restricted models), and the second was full model selection. Sometimes running the full model selection will produce a different optimal model 54 .